From the Prosper website. Prosper was founded in 2005 as the first peer-to-peer lending marketplace in the United States. Since then, Prosper has facilitated more than $17 billion in loans to more than 1,030,000 people.
Through Prosper, people can invest in each other in a way that is financially and socially rewarding. Borrowers apply online for a fixed-rate, fixed-term loan between \$2,000 and \\$40,000. Individuals and institutions can invest in the loans and earn attractive returns. Prosper handles all loan servicing on behalf of the matched borrowers and investors.
Prosper Marketplace is backed by leading investors including Sequoia Capital, Francisco Partners, Institutional Venture Partners, and Credit Suisse NEXT Fund.
Prosper Loan Dataset (Last update 03/11/2014)
This dataset contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others.
This data dictionary explains the variables in the data set.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
# Setting up the general theme of charts and color palette to use
sns.set_theme(style='white', palette='Set2')
base_color = '#00334e'
# To display charts in the same notebook
%matplotlib inline
# Pandas diplay options
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.float_format', lambda x: '%.5f' % x)
# First let's start by reading the data dictionary to understand each column of the data
data_dict = pd.read_excel('data/Prosper Loan Data - Variable Definitions.xlsx')
data_dict
# Loading the dataset and parsing the dates columns
date_columns = ['ListingCreationDate', 'ClosedDate', 'DateCreditPulled', 'FirstRecordedCreditLine', 'LoanOriginationDate']
loan = pd.read_csv('data/prosperLoanData.csv', parse_dates=date_columns)
# Display the shape of the data
loan.shape
# Display dataset summary statistics
loan.describe().T
# Disply df info
loan.info()
# Display the number of unique values for the string columns
loan.select_dtypes(include=['object']).nunique()
# Display any duplicated values if any and the number of duplicates
print(f'number of duplicated rows: {loan[loan.duplicated()].shape[0]}')
loan[loan.duplicated()]
# Number of loans for every year
loan.ListingCreationDate.dt.year.value_counts()
# Display the IncomeRange Value Counts
loan.IncomeRange.value_counts()
# Display Term Value Counts
loan.Term.value_counts()
# Display EmployementStatus Value Counts
loan.EmploymentStatus.value_counts()
# Display ProsperScore Value Counts
loan.ProsperScore.value_counts().sort_index()
# Display a sample from where ProsperScore 11
loan[loan.ProsperScore == 11].sample(10)
# Missing values from LoanStatus or BorrowerAPR
loan[['BorrowerAPR', 'LoanStatus']].info()
# Display the rows without APR rate
loan[loan['BorrowerAPR'].isnull()]
There are 113,937 observations for prosper loans with 81 features most features are numeric in type (61) float64(50), int64(11), bool(3), datetime64[ns](5), object(12)
Mainly interested in 2 features BorrowerAPR and LoanStatus to help me answer the following questions:
| Variable | Description |
|---|---|
| ListingCreationDate | The date the listing was created |
| Term | The length of the loan expressed in months |
| LoanStatus | The current status of the loan: Cancelled, Chargedoff, Completed, Current, Defaulted, FinalPaymentInProgress, PastDue. The PastDue status will be accompanied by a delinquency bucket |
| BorrowerAPR | The Borrower's Annual Percentage Rate (APR) for the loan |
| ProsperScore | A custom risk score built using historical Prosper data. The score ranges from 1-10, with 10 being the best, or lowest risk score. Applicable for loans originated after July 2009 |
| ListingCategory | The category of the listing that the borrower selected when posting their listing: 0 - Not Available, 1 - Debt Consolidation, 2 - Home Improvement, 3 - Business, 4 - Personal Loan, 5 - Student Use, 6 - Auto, 7- Other, 8 - Baby&Adoption, 9 - Boat, 10 - Cosmetic Procedure, 11 - Engagement Ring, 12 - Green Loans, 13 - Household Expenses, 14 - Large Purchases, 15 - Medical/Dental, 16 - Motorcycle, 17 - RV, 18 - Taxes, 19 - Vacation, 20 - Wedding Loans |
| BorrowerState | The two letter abbreviation of the state of the address of the borrower at the time the Listing was created |
| Occupation | The Occupation selected by the Borrower at the time they created the listing |
| EmploymentStatus | The employment status of the borrower at the time they posted the listing |
| EmploymentStatusDuration | The length in months of the employment status at the time the listing was created |
| IsBorrowerHomeowner | A Borrower will be classified as a homowner if they have a mortgage on their credit profile or provide documentation confirming they are a homeowner |
| CreditScoreRangeLower | The lower value representing the range of the borrower's credit score as provided by a consumer credit rating agency |
| CreditScoreRangeUpper | The upper value representing the range of the borrower's credit score as provided by a consumer credit rating agency |
| IncomeRange | The income range of the borrower at the time the listing was created |
| LoanCurrentDaysDelinquent | The number of days delinquent |
| LoanOriginalAmount | The origination amount of the loan |
Before I close this section let's do some data cleaning
LoanStates into (Past Due).# List of columns to use with the features of intrest
use_cols = ['ListingCreationDate',
'Term',
'LoanStatus',
'BorrowerAPR',
'ProsperScore',
'ListingCategory (numeric)',
'BorrowerState',
'Occupation',
'EmploymentStatus',
'EmploymentStatusDuration',
'IsBorrowerHomeowner',
'CreditScoreRangeLower',
'CreditScoreRangeUpper',
'IncomeRange',
'LoanCurrentDaysDelinquent',
'LoanOriginalAmount']
# Copy of the data frame to clean it
loan_clean = loan.copy()
# Use only the columns of intrest
loan_clean = loan_clean[use_cols]
# drop the data of 2005 and 2014
loan_clean = loan_clean.loc[~loan_clean.ListingCreationDate.dt.year.isin([2005, 2014])].reset_index().drop('index', axis=1)
# Drop rows with no APR values
loan_clean = loan_clean.loc[~loan_clean['BorrowerAPR'].isnull()].reset_index().drop('index', axis=1)
# Rename the ProsperRating (Alpha) and ListingCategory (numeric) columns
loan_clean = loan_clean.rename(columns={'ListingCategory (numeric)': 'ListingCategory'})
# Replace the numbers of ListingCategory to Categories
listing_category_dict = {0: 'Not Available',
1: 'Debt Consolidation',
2: 'Home Improvement',
3: 'Business',
4: 'Personal Loan',
5: 'Student Use',
6: 'Auto',
7: 'Other',
8: 'Baby&Adoption',
9: 'Boat',
10: 'Cosmetic Procedure',
11: 'Engagement Ring',
12: 'Green Loans',
13: 'Household Expenses',
14: 'Large Purchases',
15: 'Medical/Dental',
16: 'Motorcycle',
17: 'RV',
18: 'Taxes',
19: 'Vacation',
20: 'Wedding Loans'}
loan_clean.ListingCategory.replace(listing_category_dict, inplace=True)
li= ['$100,000+', '$75,000-99,999', '$50,000-74,999', '$25,000-49,999',
'$1-24,999', '$0', 'Not displayed', 'Not employed']
li.reverse()
li
# Dictionary of ordered categories
cat_dict = {'IncomeRange': ['Not employed', 'Not displayed', '$0', '$1-24,999',
'$25,000-49,999', '$50,000-74,999', '$75,000-99,999', '$100,000+'],
'ProsperScore': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'Term': [12, 36, 60]}
# List of columns to convert
cat_cols = ['EmploymentStatus', 'LoanStatus', 'ListingCategory', 'BorrowerState', 'Occupation']
# Converting ordered categories columns
for col in cat_dict:
ordered_cat = pd.api.types.CategoricalDtype(ordered = True,
categories = cat_dict[col])
loan_clean[col] = loan_clean[col].astype(ordered_cat)
# Converting other categorical clomuns
for col in cat_cols:
loan_clean[col] = loan_clean[col].astype('category')
loan_clean['AvgCreditScore'] = (loan_clean.CreditScoreRangeLower + loan_clean.CreditScoreRangeUpper) /2
loan_clean.drop(['CreditScoreRangeLower', 'CreditScoreRangeUpper'], axis=1, inplace=True)
loan_clean.LoanStatus.cat.categories
loan_clean.LoanStatus.replace(to_replace=['Past Due (1-15 days)',
'Past Due (16-30 days)',
'Past Due (31-60 days)',
'Past Due (61-90 days)',
'Past Due (91-120 days)',
'Past Due (>120 days)'],
value='Past Due', inplace=True)
loan_clean.info()
# Save the Cleaned Dataset into csv file
loan_clean.to_csv('data/cleaned_prosper.csv', index=False)
# Ploting the distribution of BorrowerAPR
fig, ax = plt.subplots(figsize=(16, 6), ncols=2)
sns.histplot(loan_clean.BorrowerAPR, kde=False, color=base_color, binwidth=0.01, ax=ax[0])
sns.despine(left=True);
# Zooming from 0.05 to 0.4
sns.histplot(loan_clean.BorrowerAPR, kde=False, color=base_color, binwidth=0.01, ax=ax[1], binrange=[0.05, 0.4])
sns.despine(left=True);
loan_clean.query('BorrowerAPR < 0.01 | BorrowerAPR > 0.45').sort_values('BorrowerAPR')
loan_clean.BorrowerAPR.describe()
BorrowerAPR is approximatly normal with a mean of 22.26%, median of 21.47%, minmun rate of 0.65% and maximum of 51.23%. and most of the values ranges between 10% and 45%.¶# Ploting the distribution of LoanStatus
LoanStatus_order=loan_clean["LoanStatus"].value_counts().index
fig, ax = plt.subplots(figsize=(12, 6))
sns.countplot(data=loan_clean, x='LoanStatus', color=base_color, ax=ax, order=LoanStatus_order)
sns.despine(left=True);
LoanStatus: The most status is Current followd by completed, Chargedoff and Defaulted and the rest is less than 400 and Canceled loans are almost 0.¶# Listing the columns to plot
cols = ['ProsperScore', 'Term', 'EmploymentStatus', 'IncomeRange']
# Create 5 rows subplots
fig, ax = plt.subplots(figsize=(10, 20), nrows=4)
# Plotting loop
for i, col in enumerate(cols):
# If the column dtype is ordered category use it`s order if not order by highest value
if loan_clean[col].cat.ordered:
order = None
else:
order = loan_clean[col].value_counts().index
# Ploting
sns.countplot(data=loan_clean, x=col, order=order,color=base_color, ax=ax[i])
sns.despine(left=True)
# If the number of unique values more the 7 turn switch the axis to prevent the ticks from overlapping
if col == 'IncomeRange':
plt.xticks(rotation=90)
# Listing the columns to plot
cols = ['ListingCategory', 'BorrowerState', 'Occupation']
# Create 3 columns subplots
fig, ax = plt.subplots(figsize=(20, 20), ncols=3)
# Plotting loop
for i, col in enumerate(cols):
# If the column dtype is ordered category use it`s order if not order by highest value
if loan_clean[col].cat.ordered:
order = None
else:
order = loan_clean[col].value_counts().index
# Ploting
sns.countplot(data=loan_clean, y=col, order=order,color=base_color, ax=ax[i])
sns.despine(left=True)
# Creating plots for listing creation data by year and month
fig, ax = plt.subplots(figsize=(16, 6), ncols=2)
g = sns.countplot(x=loan_clean.ListingCreationDate.dt.year, color=base_color, ax=ax[0])
sns.despine(left=True)
g.set(xlabel='Loan Creation Year');
order = ['January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December']
g = sns.countplot(x=loan_clean.ListingCreationDate.dt.month_name(), order=order, color=base_color, ax=ax[1])
sns.despine(left=True)
g.set(xlabel='Loan Creation Month')
plt.xticks(rotation=45);
# Preparing for pie plot
isHomeOwner = loan_clean.IsBorrowerHomeowner.value_counts()
# Creating a pie chart for IsBorrowerHomeowner colum
plt.pie(isHomeOwner,
labels=['Home Owner', 'Not a Home Owner'],
startangle=90,
counterclock=False,
wedgeprops = {'width' : 0.5},
colors=['#00e560', '#e50020'])
plt.axis('square');
# Ploting the distribution of EmploymentStatusDuration
binsize = 20
bins = np.arange(0, loan_clean.EmploymentStatusDuration.max()+binsize, binsize)
fig, ax = plt.subplots(figsize=(16, 6), ncols=2)
ax[0].hist(loan_clean.EmploymentStatusDuration, color=base_color, bins=bins)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].spines['left'].set_visible(False)
ax[0].set_xlabel('Employment Status Duration');
# Ploting the log of EmploymentStatusDuration
log_bins = 10 ** np.arange(0, 3.0+0.1, 0.1)
ticks = [ 1, 3, 10, 30, 100, 300, 1000]
ax[1].hist(loan_clean.EmploymentStatusDuration, color=base_color, bins=log_bins)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].spines['left'].set_visible(False)
ax[1].set_xscale('log')
ax[1].set_xticks(ticks)
ax[1].set_xticklabels(ticks)
ax[1].set_xlabel('Employment Status Duration (log)');
# Ploting Credit score ranges lower and upper
binsize = 10
bins = np.arange(0, loan_clean.AvgCreditScore.max()+binsize, binsize)
# Create a 1x2 Plot
fig, ax = plt.subplots(figsize=(16, 5), ncols=2)
# Ploting AvgCreditScore
ax[0].hist(loan_clean.AvgCreditScore, color=base_color, bins=bins)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].spines['left'].set_visible(False)
ax[0].set_xlabel('Average Credit Score');
# Ploting AvgCreditScore
ax[1].hist(loan_clean.AvgCreditScore, color=base_color, bins=bins)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].spines['left'].set_visible(False)
ax[1].set_xlabel('Average Credit Score')
ax[1].set(xlim=[450, 850]);
# Ploting LoanCurrentDaysDelinquent
# Create a 2x2 Plot
fig, ax = plt.subplots(figsize=(20, 6), ncols=2)
# Ploting LoanCurrentDaysDelinquent
binsize_delinquent = 50
bins_delinquent = np.arange(0, loan_clean.LoanCurrentDaysDelinquent.max()+binsize_delinquent, binsize_delinquent)
ax[0].hist(loan_clean.LoanCurrentDaysDelinquent, color=base_color, bins=bins_delinquent)
ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].spines['left'].set_visible(False)
ax[0].set_xlabel('Loan Current Days Delinquent');
# Ploting LoanCurrentDaysDelinquent Zoomed in range from 1 to 300
binsize_delinquent_zoomed = 5
bins_delinquent_zoomed = np.arange(1, loan_clean.LoanCurrentDaysDelinquent.max()+binsize_delinquent_zoomed, binsize_delinquent_zoomed)
ax[1].hist(loan_clean.LoanCurrentDaysDelinquent, color=base_color, bins=bins_delinquent_zoomed)
ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].spines['left'].set_visible(False)
ax[1].set_xlabel('Loan Current Days Delinquent')
ax[1].set(xlim=[1, 300]);
# Ploting Loan Original Amount Distribution
plt.figure(figsize=(10, 5))
sns.histplot(loan_clean.LoanOriginalAmount, color=base_color, kde=False, binwidth=1000)
sns.despine(left=True)
plt.axvline(x=loan_clean.LoanOriginalAmount.mean(), color='#20fa30', linewidth=3) # Show the mean with a green virtical line
plt.axvline(x=loan_clean.LoanOriginalAmount.median(), color='#2030fa', linewidth=3); # Show the median with a blue virtical line
BorrowerAPR and LoanStatus.BorrowerAPR is roughly normal with a peak around 35% and didn't to perform any kind of transformations.LoanStatus: The most status is Current followd by completed, Chargedoff and Defaulted and the rest is less than 400 and Canceled loans are almost 0. Let's start by plotting the correlation coeffetient for the numeric variables in our data
# Listing the numeric variables
num_cols = ['BorrowerAPR', 'ProsperScore', 'EmploymentStatusDuration',
'AvgCreditScore', 'LoanCurrentDaysDelinquent', 'LoanOriginalAmount']
# Heatmap Plot
plt.figure(figsize=(8, 6))
sns.heatmap(loan_clean[num_cols].corr(),
annot=True,
fmt='.3f',
cmap='YlGnBu',
center=0,
linewidths=.5,
square=True);
# Sample for Pair Grid
loan_samp = loan_clean.sample(1000)
# Pair grid plot for numeric variables
g = sns.PairGrid(data=loan_samp, vars=num_cols)
g = g.map_diag(sns.histplot, kde=False, color=base_color)
g.map_offdiag(sns.scatterplot, color=base_color, alpha=0.4, s=10);
BorrowerAPR and LoanOrignalAmount, we may expect that the bigger the loan amount the less the interest rate.BorrowerAPR and ProsperScore we may expect that the bigger the score the less the interest rate.BorrowerAPR and AvgCreditScore we may expect that the bigger average score the less interest rate.# List the high correlated variables with BorrowerAPR to zoom into their scatter plot
apr_high_corr = ['ProsperScore', 'AvgCreditScore', 'LoanOriginalAmount']
fig, axis = plt.subplots(figsize=(16, 20), nrows=3, ncols=2, sharey=True)
for i, var in enumerate(apr_high_corr):
sns.regplot(data=loan_clean,
x=var,
y='BorrowerAPR',
color=base_color,
scatter_kws={'alpha':0.4, 's':5},
ax=axis[i, 0])
sns.despine(left=True);
sns.histplot(x=loan_clean[var].astype('float64'),
y=loan_clean.BorrowerAPR,
color=base_color,
cbar=True,
ax=axis[i, 1])
sns.despine(left=True);
loan_clean.select_dtypes(include=['category', 'bool']).describe()
loan_clean.select_dtypes(exclude=['category', 'bool']).describe(datetime_is_numeric=True)
# Listing the categoric variables
cat_cols = ['Term', 'LoanStatus', 'ProsperScore',
'EmploymentStatus', 'IsBorrowerHomeowner', 'IncomeRange']
# Creating 3x2 plot for the vars with unique values less than 15
fig1, ax = plt.subplots(figsize=(20, 20), ncols=2, nrows=3, constrained_layout=True) # Constrained_layout to avoid plots overlapping
# Flatten the axis into 1d numpy array
ax = ax.flatten()
# Plotting loop
for i, var in enumerate(cat_cols):
sns.boxplot(data=loan_clean, x=var, y='BorrowerAPR', color='#33b5e5', ax=ax[i])
sns.despine(left=True, ax=ax[i])
if var in ['LoanStatus', 'EmploymentStatus', 'IncomeRange']:
plt.setp(ax[i].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
# plotting the distribution for BorrowerAPR by ListingCategory
order = loan_clean.ListingCategory.value_counts().index
g = sns.FacetGrid(data = loan_clean,
col='ListingCategory',
col_wrap=5,
col_order=order,
sharey=False,
height=2,
aspect=1.5)
g.map(sns.histplot,
'BorrowerAPR',
kde=False,
color=base_color)
g.set_titles('{col_name}');
# plotting the distribution for BorrowerAPR by BorrowerState
order = loan_clean.BorrowerState.value_counts().index
g = sns.FacetGrid(data = loan_clean,
col='BorrowerState',
col_wrap=6,
col_order=order,
sharey=False,
height=2,
aspect=1.5)
g.map(sns.histplot,
'BorrowerAPR',
kde=False,
color=base_color)
g.set_titles('{col_name}');
# plotting the distribution for BorrowerAPR by Occupation
order = loan_clean.Occupation.value_counts().index
g = sns.FacetGrid(data = loan_clean,
col='Occupation',
col_wrap=7,
col_order=order,
sharey=False,
height=2,
aspect=1.5)
g.map(sns.histplot,
'BorrowerAPR',
kde=False,
color=base_color)
g.set_titles('{col_name}');
BorrowerAPR and the payment Term.BorrowerAPR is for Completed followed by Current LoanStatus.BorrowerAPR and ProsperScore is appearing here again.BorrowerAPR and EmploymentStatus as it tends to be lower for emplyed individuals than not employed with some cosiderations for retired and part time jobs to be the lowest.BorrowerAPR tend to be lower for Homeowners than if the individual doesn't own a house.BorrowerAPR tend to be high for not emploed individuals in IncomeRange with some considerations for people with 0 income.BorrowerAPR and ListingCategory some reasons tend to get lower rate than others but we will dig more in multivariate exploration to see if this effect because of the ListingCategory or there is another correlated variable that make this effect.BorrowerAPR and the BorrowerState and Occupation are not clear.# Listing the columns to plot
cols = ['LoanStatus', 'Term', 'EmploymentStatus', 'IsBorrowerHomeowner', 'IncomeRange']
# Create 5 rows subplots
fig, ax = plt.subplots(figsize=(16, 20), nrows=5, constrained_layout=True)
# Plotting loop
for i, col in enumerate(cols):
# If the column dtype is ordered category use it`s order if not order by highest value
try:
if loan_clean[col].cat.ordered:
order = None
else:
order = loan_clean[col].value_counts().index
except:
order = None
pass
# Ploting
sns.countplot(data=loan_clean, x=col, order=order, hue='ProsperScore', palette='Blues', ax=ax[i])
sns.despine(left=True)
ax[i].legend(loc=1, ncol=2, framealpha=.7)
# Prevent the ticks from overlapping for IncomeRange and LoanStatus
if col in ['IncomeRange', 'LoanStatus']:
plt.setp(ax[i].get_xticklabels(), rotation=30, ha="right", rotation_mode="anchor")
ProsperScore is roughly the same for other categories.# Ploting countplot for LoanStatus by IncomeRange
# Create 5 rows subplots
fig, ax = plt.subplots(figsize=(16, 5))
# Ploting
sns.countplot(data=loan_clean, x='LoanStatus', hue='IncomeRange', palette='RdBu', ax=ax)
sns.despine(left=True)
ax.legend(loc=1, ncol=2, framealpha=.7);
plt.xticks(rotation=90);
# LoanStatus vs IsBorrowerHomeOwner
plt.figure(figsize=(16, 5))
sns.countplot(data=loan_clean,
x='LoanStatus',
hue='IsBorrowerHomeowner',
palette=['#e50020', '#00e560'])
sns.despine(left=True)
plt.xticks(rotation=90);
# Preparing for HeatMap plot
df_loan_income = loan_clean.groupby(['LoanStatus', 'IncomeRange']).count()['ListingCreationDate'].reset_index()
df_loan_income.columns = ['LoanStatus', 'IncomeRange', 'Count']
df_loan_income = df_loan_income.pivot(index = 'LoanStatus', columns = 'IncomeRange', values = 'Count').fillna(0)
# Heatmap Plot
plt.figure(figsize=(12, 6))
sns.heatmap(df_loan_income,
annot=True,
fmt=',.0f',
cmap='rocket_r',
mask=(df_loan_income<0.5),
linewidths=.5,
square=True);
IncomeRange vary accross the LoanStatus IncomeRange for current loans which means that the policy of the website changed lately.# Preparing for HeatMap plot
df_loan_employment = loan_clean.groupby(['LoanStatus', 'EmploymentStatus']).count()['ListingCreationDate'].reset_index()
df_loan_employment.columns = ['LoanStatus', 'EmploymentStatus', 'Count']
df_loan_employment = df_loan_employment.pivot(index = 'LoanStatus', columns = 'EmploymentStatus', values = 'Count').fillna(0)
# Heatmap Plot
plt.figure(figsize=(12, 6))
sns.heatmap(df_loan_employment,
annot=True,
fmt=',.0f',
cmap='rocket_r',
mask=(df_loan_employment<0.5),
linewidths=.5,
square=True);
EmploymentStatus vary accross the LoanStatus EmploymentStatus for current loans which means that the policy of the website changed lately.# Creating a df for map plotting
df_map = loan_clean[['BorrowerState', 'LoanOriginalAmount', 'BorrowerAPR']].groupby('BorrowerState').agg({'BorrowerAPR':'mean', 'LoanOriginalAmount':'sum'}).reset_index()
df_map.columns = ['state', 'apr', 'amount']
df_map.head()
# Setting the figure for plotly map
fig = go.Figure(data=go.Choropleth(locations=df_map.state, # Spatial coordinates
z = df_map.apr.astype(float) * 100 , # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
text=df_map.state,
colorscale = 'RdBu',
autocolorscale=False,
marker_line_color='darkgray',
marker_line_width=0.5,
colorbar_title = "APR %",
colorbar_tickprefix = '%',
hovertemplate = '<i>APR</i>: %{z:.2f}<extra>%{text}</extra>', # Hoovering Text
showlegend=False))
# Setting the layout
fig.update_layout(
hoverlabel_align = 'right',
title_text = 'Average BorrowerAPR by State',
geo_scope='usa', # limite map scope to USA
)
# Setting the figure for plotly map
fig = go.Figure(data=go.Choropleth(locations=df_map.state, # Spatial coordinates
z = df_map.amount.astype(float), # Data to be color-coded
locationmode = 'USA-states', # set of locations match entries in `locations`
text=df_map.state,
colorscale = 'Blues',
autocolorscale=False,
marker_line_color='darkgray',
marker_line_width=0.5,
colorbar_title = "USD",
colorbar_tickprefix = '$',
hovertemplate = '<i>Amount</i>: $%{z:,.0f}<extra>%{text}</extra>', # Hoovering Text
showlegend=False))
# Setting the layout
fig.update_layout(
hoverlabel_align = 'right',
title_text = 'Total original loan amount by State',
geo_scope='usa', # limite map scope to USA
)
LoanOriginalAmount, ProsperScore and AvgCreditScore are the most features that affect the BorrowerAPR as the larger they are the lower the interest rate is.LoanStatus is effected the most by IncomeRange and EmploymentStatus as most of defaulted loans are for people who were unemployed. AvgCreditScore and LoanOriginalAmount is positively correlated we can expect that the higher score the higher the loan amount the individual can borrow.LoanOriginalAmount was for individuals who are located in California state.def hist2dgrid(x, y, **kwargs):
""" Quick hack for creating heat maps with seaborn's PairGrid. """
palette = kwargs.pop('color')
bins_x = np.arange(0, 12, 1)
bins_y = np.arange(0, 0.6, 0.1)
plt.hist2d(x, y, bins = [bins_x, bins_y], cmap = palette, cmin = 0.5)
df_hist2d = loan_clean.query('EmploymentStatus != "Not available"').copy()
# Ploting heatmap for ProsperScore and BorrowerAPR by EmploymentStatus
g = sns.FacetGrid(data = df_hist2d,
col='EmploymentStatus',
col_wrap = 4,
height = 3)
g.map(hist2dgrid, 'ProsperScore', 'BorrowerAPR', color = 'inferno_r')
g.set_titles('{col_name}')
g.despine(left=True);
# Ploting heatmap for ProsperScore and BorrowerAPR by IsBorrowerHomeowner
g = sns.FacetGrid(data = df_hist2d,
col='IsBorrowerHomeowner',
col_wrap = 4,
height = 3)
g.map(hist2dgrid, 'ProsperScore', 'BorrowerAPR', color = 'inferno_r')
g.set_titles('{col_name}')
g.despine(left=True);
# Ploting heatmap for ProsperScore and BorrowerAPR by LoanStatus
g = sns.FacetGrid(data = df_hist2d,
col='LoanStatus',
col_wrap = 4,
height = 3)
g.map(hist2dgrid, 'ProsperScore', 'BorrowerAPR', color = 'inferno_r')
g.set_titles('{col_name}')
g.despine(left=True);
# Plotting the relation between ProsperScore, BorrowerAPR and EmployementStatus
fig, ax = plt.subplots(figsize=(20, 8))
sns.pointplot(data=loan_clean,
x='LoanStatus',
y='BorrowerAPR',
hue='EmploymentStatus',
linestyles="",
dodge=0.5,
palette='inferno_r')
sns.despine(left=True)
plt.legend(loc=2, ncol=8, framealpha=0.7)
plt.show();
# Creating a df for map plotting
df_scatter = loan_clean.copy()
df_scatter['year'] = df_scatter.ListingCreationDate.dt.year
df_scatter = df_scatter[['BorrowerState',
'year',
'LoanOriginalAmount',
'BorrowerAPR',
'ListingCreationDate']].groupby(['BorrowerState', 'year']).agg({'BorrowerAPR':'mean',
'LoanOriginalAmount':'mean',
'ListingCreationDate':'count'}).reset_index()
df_scatter.columns = ['state', 'year', 'AverageAPR', 'AverageLoanAmount', 'count']
# Setting the text for hovering over the state
df_scatter.head()
# Plotting animated scatter
fig = px.scatter(df_scatter,
x="AverageLoanAmount",
y="AverageAPR",
animation_frame="year",
animation_group="state",
size="count",
color="state",
size_max=50)
fig.layout.update(title_text="Average BorrowerAPR vs Average Orignal Amount by State (2006-2013)",
title_font_size=14,
showlegend=True,
transition= {'duration':20000000})
fig.show()